Reconstruction of protein structures from a vectorial representation 
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We show that the contact map of the native structure of globular proteins can be reconstructed 
starting from the sole knowledge of the contact map's principal eigenvector, and present an exact 
algorithm for this purpose. Our algorithm yields a unique contact map for all 221 globular structures 
of PDBselect25 of length N < 120. We also show that the reconstructed contact maps allow in 
turn for the accurate reconstruction of the three-dimensional structure. These results indicate that 
the reduced vectorial representation provided by the principal eigenvector of the contact map is 
equivalent to the protein structure itself. This representation is expected to provide a useful tool 
in bioinformatics algorithms for protein structure comparison and alignment, as well as a promising 
intermediate step towards protein structure prediction. 



PACS numbers: 87.14.Ee, 87.15. Cc, 87.15. Aa 
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Introduction. - Despite several decades of intense re- 
search, the reliable prediction of the native state of a pro- 
tein from its sequence of amino acids is still a formidable 
challenge Q] . Every two years the state of the art is as- 
sessed by the CASP experiment . In the most detailed 
predictions, the list of the Cartesian coordinates of all 
the atoms of the protein molecule are provided. Since 
the structure of a protein can be also represented as a 
contact map [j| , lower-resolution predictions can be lim- 
ited to the determination of inter- residue contacts |4J . 

Ideally, one would like to predict the structure of a pro- 
tein using the representation that is encoded in the most 
straightforward way in the sequence of amino acids. Due 
to the vectorial nature of the sequence, one may guess 
that the simplest representation to predict should be vec- 
torial as well. In this Letter, we show that the principal 
eigenvector (PE) of the contact map (CM) of the na- 
tive structure is equivalent to the CM itself, and there- 
fore provides a very promising vectorial representation of 
protein structures. This vectorial representation is also 
expected to improve bioinformatics algorithms for pro- 
tein structure alignment as well as alignment of protein 
sequence with a database of structures (fold recognition) . 

The PE of the CM has already been used as an in- 
dicator of protein topology, in particular as a mean of 
identifying structural domains |5j and clusters of amino 
acids with special structural significance Here, we 

present an exact algorithm to reconstruct a CM from the 
knowledge of its PE (cf. Fig. This step is analogous 
to that of reconstructing the three-dimensional protein 
structure based on the CM of the structure For the 
proteins that we studied, the PE is sufficient to recon- 
struct uniquely a CM. This means that the information 
about all other eigenvectors and eigenvalues of a CM is 
contained in the PE, and therefore it is equivalent to rep- 
resent a protein structure either by the CM or the PE of 



the CM. This result is likely due to the binary nature of 
the entries of the CM and to the fact that the topology of 
the protein chain imposes significant constraints on the 
non-zero entries 

Contact maps and their principal eigenvectors. - The 
contact map C of a protein structure is a binary sym- 
metric matrix, with elements Cy — 1 if amino acids at 
positions i and j are in contact, and otherwise 0,0. 
Two residues are defined to be in contact if at least one 
pair of heavy atoms, one belonging to each amino acid, 
are less than 4.5 A apart. Other contact definitions exist 
in the literature, for instance based on a distance thresh- 
old on the C a atoms |8j , but the algorithm presented be- 
low does not depend on the detailed contact condition. 
Additionally, only residues separated by at least three 
positions along the sequence are considered in contact, 
so that Cij = if \i — j\ < 3. In such a way, trivial short 
range contacts are not taken into account. 

In what follows, A will denote the principal (i.e., 
largest) eigenvalue of C [lfj, and v the corresponding 
PE. Since C is a real symmetric matrix, its eigenvalues 
are real. The principal eigenvalue A has a value between 
the average number of contacts per residue, ( J^j 
and the maximal number of contacts of any given residue, 
max; ( . Cij) The non-zero components of v have 
all the same sign, which we choose to be positive. 

The PE maximizes the quadratic form J"\ ■ Cij Vi Vj 
with the constraint J2i v i = 1- I 11 this sense, Vi can 
be interpreted as the effective connectivity of position i, 
since positions with large i>j are in contact with as many 
as possible positions j with large Vj. As we discuss be- 
low, one can show that if the CM represents a single 
connected graph, all the structural information is actu- 
ally contained in its PE. For proteins consisting of several 
distinct domains, the CM becomes a block matrix, and 
the PE contains information only on the largest block 
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FIG. 1: The protein Amylase/ Serine protease inhibitor with 
PDB id. lbea. Shown are: (a) The three-dimensional struc- 
ture (drawn using MolScript and Raster3D), (b) the CM C 
(black means dj = 1 and white means Cy = 0), and (c) the 
PE v. The reduction from (a) to (b) is done by a distance 
threshold, and the one from (b) to (c) is by a diagonalization. 
The reconstruction from (c) to (b) is described in this Letter, 
whereas the one from (b) to (a) is described in Ref. §. 



(domain) . 

Reconstructing a contact map from its principal eigenvec- 
tor. - The reconstruction algorithm is based on imposing 
that the matrix C, applied to v, fulfills the secular equa- 
tion Cv = Av. At first glance, this equation does not 
seem to be sufficient to determine the CM, as there are 
infinitely many matrices M / C that fulfill Mv = Av. 
However, additional constraints can be imposed using the 



fact that the elements of C assume only the values and 
I, and that consequently A as well as all non-zero compo- 
nents of v are positive. The central idea is to exploit these 
additional constraints, as they allow to apply a 'greedy' 
scheme to search for possible solutions. We first discuss 
the algorithm in the hypothesis that all components of 
the v and the corresponding principal eigenvalue A are 
known. We will show later that, for CMs of protein folds, 
it is possible to deduce A from the components of the PE, 
so that only this quantity has to be known. The algo- 
rithm proceeds according to the following steps: 

(i) Elements Cy with \i — j\ < 3 are set to 0. The re- 
maining elements of C are marked as 'unknown.' 

(ii) For all positions i for which the PE vanishes, i.e. 
Vi = 0, all 'unknown' entries in the i-th line and the i- 
th column of C are set to 0, i.e. Cy = Cji — for all 
j (concerning 'unknown' elements Cy for which both Vi 
and Vj vanish see below). 

(iii) The non-zero components of v are sorted and treated 
recursively in increasing order, starting with the small- 
est value. Let i be the position presently examined, with 
the aim to determine the i-th line and i-th column of C. 
Some elements Cy = Cji with j £ J = {ji,j2, ■ ■ •} are 
already known, whereas some other elements Cik = Cki 
with fc e /C = {fci, /c2, • ■ •} are still 'unknown.' To eval- 
uate the latter, we calculate the sum X^ej Cij v j °f the 
known elements J , and consider three different possibil- 
ities: 

(iii/1) The sum X^ej Cij v j ^ s equal to Ai>$ up to the 
chosen precision e, i.e. | Y^,jeJ Qj v j ~ ^ v i\ < e - Then 
one solution for the i-th line and the i-th column of C 
has been found, since the i-th line of the secular equation 
can be fulfilled by setting to all 'unknown' elements in 
the i-th line and the i-th column of C, Cik — Cki = for 
k £ fC, and the recursion returns with success. 
(iii/2) Otherwise, if either J2jeJ^ij v j > ^ v ii meaning 
numerically that Yl^j CijVj — Xvi > e, or if the set of 
'unknown' entries K, is empty, the i-th line of the secular 
equation cannot be fulfilled with the present set J , the 
recursion is in a dead end and returns with failure. 
(iii/3) Finally, in the remaining cases, the 'unknown' el- 
ements in the non-empty set JC have to be further pro- 
cessed. To eliminate candidates which lead to a dead end, 
the set of elements JC is sorted by the values Vk in decreas- 
ing order. Starting with the element fc £ JC which has the 
largest value Vk, the sum J2jeJ ^ij v 3 + v k 1S calculated. 
There are two possibilities: (a) If YljeJ Cij v j+ V k > At?i, 
meaning numerically that J2jeJ CijVj + Vk — Xvi > e, 
then it follows that Cik = Cki = 0, since all components 
of v are positive, (b) If ^ij v j + v k < ^ v i> mean- 

ing numerically that CijVj + Vk — Xvi < e, then 

Cik = Cki is allowed to assume both values and f , and 
the search branches. In both cases (a) and (b), the al- 
gorithm continues recursively until the set fC = tC\k 
of 'unknown' entries is empty, and either a failure or a 
success is reported. 
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After examining the i-th line in step (iii), the algo- 
rithm has found all possible binary matrices fulfilling the 
i-th line of the secular equation, as well as the lines pre- 
viously treated. For each of these partial solutions, the 
reconstruction continues with the line corresponding to 
the next largest component of v. Thereby, a tree is con- 
structed which grows by another generation t every time 
a new line of the secular equation is evaluated. The leaves 
of the tree at generation t are the partial solutions that 
fulfill the so far treated t lines of the secular equation. 
Some of the leaves of the tree 'die' in the step from gen- 
eration t — 1 to t, if no solution for the presently treated 
line of the secular equation is found that fulfills the pre- 
viously treated t — 1 lines of the secular equation as well. 
The search continues either until all lines of the secu- 
lar equation have been treated, or until all leaves have 
'died.' In the former case, at least one complete solution 
is found, whereas in the latter case the reconstruction 
stops without result. Due to the particular order of the 
search, the exponentially large number of possible CMs 
is explored in a 'greedy' way, limiting the search to the 
smallest possible subset by discarding wrong parts of the 
tree already in an early stage of the search. For most of 
the proteins we have studied (see below), the number of 
leaves in any given step did not exceed the value of 10 
for a properly chosen value of the threshold e. 

Notice that the elements Cy = Cji, \i — j\ > 2, for 
which both Vi and Vj vanish are not determined by the 
PE. These positions belong to a set which does not in- 
teract with the set of positions contributing to the PE, 
either as a completely independent structural domain, or 
as an isolated residue without any contact. In step (ii), 
we set such values of CV, = Cji to for convenience. 
Their actual value can be determined at a later stage, 
when the CM obtained through our algorithm is submit- 
ted to a procedure such as the one of Ref. ||, in order 
to determine the three-dimensional structure. This sec- 
ond step yields the three-dimensional structure, satisfy- 
ing physical constraints, whose CM is most similar to the 
one reconstructed from the PE. 

If the principal eigenvalue A is not known, a simple way 
to guess its value consists in taking the smallest non-zero 
PE component m, and use (i) all ratios Vj/vi with non- 
zero Vj and \i — j\ > 2, (ii) all ratios (vj + Vk)/vi with 
non-zero Vj and Vk, k > j, \i — j\ > 2, and \i — k\ > 2, 
and (iii) all ratios (vj + Vk + vi ) / Vi with non-zero Vj , Vk , 
and vi, I > k > j, \i—j\ > 2, \i — k\ > 2, and \i — l\ > 2 as 
candidate values. The first choice corresponds to assum- 
ing that position i has only a single contact, the second 
corresponds to assuming that position i has two contacts, 
whereas the third corresponds to assuming that position 
i has three contacts. Values larger than 10 (for CMs on 
a 4.5 A threshold on the heavy atoms) are discarded. In 
this way, one can find a set of 'guesses' of the princi- 
pal eigenvalue, which, for CMs of proteins, contains the 
correct value. All wrong guesses for the principal eigen- 
value get quickly discarded by the algorithm, since the 



simulation runs into a dead end in these cases. 

Another important issue concerns the choice of the nu- 
merical threshold e. If the threshold is too small, it may 
be that no solution is found, due to the numerical round- 
off error on the PE components. If it is too large, the 
tree branches too often and the search becomes unman- 
ageable. In our calculations, values of order e = 10~ 6 
represented typically a good compromise. Alternatively, 
as we have done in this study, the value of e can be auto- 
matically adjusted by letting it slowly increase, starting 
with a small value, until a solution is found. Note that 
alternatively to a threshold on the absolute error, it is 
possible to apply a threshold on the relative error. When 
doing so, for instance the equality J2jej ^ij v j = 
becomes | YljeJ Cij v j — Awj|/(AUi) < e. The latter con- 
dition is usually more appropriate if the entries Vi are 
quite broadly distributed. 

Results. - In order to assess the performance of the 
algorithm, we set out to reconstruct the 221 globular 
protein structures of PDBselect25 of length N < 120. 
We diagonalized the CMs of these 221 proteins to obtain 
their PE. Then, we applied to each PE the reconstruc- 
tion scheme described above. In 205 cases, the algorithm 
yielded a unique solution, identical to the original na- 
tive CM. Clearly, in these cases it is possible to recover 
the three-dimensional protein structure with the same 
accuracy achieved when starting with the native CM, 
for instance using the scheme of Ref. ||, with a typi- 
cal root mean square displacement (RMSD) of around 
2 A. In the remaining 16 cases, the algorithm yielded a 
unique solution, yet the reconstructed CM differed from 
the original one in one or several missing contacts, up 
to 8% of the native ones, which were undetermined since 
the corresponding pairs of components of the PE are 0. 
Nevertheless, the reconstructed CMs were very similar 
to the native ones, so that it is still possible to recover 
the three-dimensional protein structure with a high ac- 
curacy, as the RMSD was increased by 10% or less with 
respect to the structure obtained using the native CM. 
In all cases only a single CM was found and therefore 
the PE defines essentially in a unique way a CM, and no 
false contact was contained in any of the obtained CMs. 

It should be noted that the reconstruction of the CMs 
can differ considerably in computational expenses, the 
three most difficult cases in the present set of proteins 
being the CM of PDB id. lgif _A (N = 115), of PDB id. 
lpoa (N = 118), and of PDB id. lbnk_A (N = 120). For 
these proteins, there are many almost identical non-zero 
components of the PE, so that our strategy to efficiently 
eliminate dead ends was not very effective, leading to an 
excessive branching of the search. However, even in these 
cases the solution of the secular equation is unique and 
could be found with an extensive search. 
Discussion. - We have shown that the PE determines 
uniquely the CM, apart from elements that correspond 
to pairs of positions with vanishing PE components and 
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are hence undetermined. This is a surprising result, 
since it means that the remaining TV — 1 eigenvectors 
and eigenvalues of the CM, where N is of order 100 or 
larger, are completely determined by the PE. It can be 
understood by noting that the number of CMs is large 
but finite: The total number of N x N symmetric bi- 
nary matrices is 2 N ^ N+1 ^ 2 , whereas the total number of 
CMs representing protein structures increases with N as 
exp(aiV), since the chain connectivity introduces corre- 
lations among the contacts of neighboring residues @. 
In contrast, the number of PEs which are non-identical 
up to a given precision e increases with decreasing e as 
e~ N without bound. Therefore, for sufficiently small e, 
the number of these distinct PEs is much larger than the 
number of CMs, so that the overwhelming majority of 
them do not correspond to any CM, and we can expect 
that those which do correspond to a CM correspond to 
a unique CM, as our results confirm. We note that this 
is true, however, only for sufficiently accurate PEs. For 
noisy PEs, the exact algorithm presented in this Letter is 
not suitable, and one has to resort to a stochastic search, 
such as for example a Monte Carlo scheme. Such scheme 
is considerably more complicated than the algorithm pre- 
sented here, mainly due to the existence of CMs which 
do not correspond to a physically realizable structure but 
have a PE being almost identical to the one of the target 
CM. Preliminary results suggest that the reconstruction 
of a noisy PE might be possible when constraining the 
search to CMs corresponding to protein-like structures 
with proper secondary structure and steric interaction. 

The PE only contains information on the largest con- 
nected component of the graph representing the protein 
structure. Nodes which are not connected with it have 
vanishing PE components, and their mutual links are un- 
determined. Therefore, our scheme is only able to de- 
termine the CM for the subset of positions which have 
non-zero PE components. For single-domain globular 
proteins, they represent the great majority of the po- 
sitions in the chain, with the only exceptions of small 
loops or single residues completely exposed to the sol- 
vent. For non-globular proteins, whose CM is sparse and 
whose number of contacts per residue is hence smaller 
than a threshold, the connected positions are few and 
the method is able to yield only a portion of the CM. 
However, such structures are problematic in any case, 
since their CM does not determine a well defined three- 
dimensional structure, and they are not thermodynami- 
cally stable in absence of other protein chains and other 
molecules with which they interact (there are 27 such pro- 
tein structures in PDBselect25 of length TV < 120 which 
we omitted in our study) . More important difficult cases 
are structures constituted of several almost independent 
domains. For such structures, the PE components are 
very small outside the principal domain, although they 
are not zero, since there is always a small number of 
inter-domain contacts. 

In conclusion, we have presented an exact algorithm 



that allows the CM of a protein structure to be recon- 
structed from the sole knowledge of its PE. The result- 
ing CM can then be used to reconstruct the full three- 
dimensional structure, for instance using the scheme of 
Ref. ||- In this sense, the three-dimensional structure 
of a protein fold can be reduced ('compressed') into the 
PE of its CM, from which it can be recovered ('decom- 
pressed') with no or minimal information loss. We have 
applied the algorithm to the set of 221 globular proteins 
of PDBselect25 of length N < 120 and obtained in all 
cases a unique CM from the PE. In terms of structure 
representation, our results show that a CM and its PE 
are equivalent, which leads to a significant simplification 
in the representation of protein folds. We anticipate that 
this result will create new possibilities in bioinformat- 
ics applications, in particular those involving alignments 
of structure to structure and of sequence to structure. 
Furthermore, our results have important implications on 
our understanding of protein evolution. We have shown 
in fact that the PE is correlated with the hydrophobic- 
ity profile of the amino acid sequence attaining the fold, 
and even more correlated with the hydrophobicity pro- 
file averaged over many sequences attaining the fold, so 
that protein evolution can be understood as the motion 
of the hydrophobicity profile around the PE of the fold 
[l2j |. Finally, since the PE is related to the contact vec- 
tor 0, which it seems to be possible to predict from 
the protein sequence i a coarse prediction of the PE 
may be possible as well and thus the PE may become an 
effective tool in protein structure prediction approaches. 
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